Erich Schubert: Java sum-of-array comparisons
This is a follow-up to the
post by Daniel Lemire
on a close topic.
Daniel Lemire hat experimented with boxing a primitive array in an interface,
and has been trying to measure the cost.
I must admit I was a bit sceptical about his results, because I have
seen Java successfully inlining code in various situations.
For an experimental library I occasionally work on, I had been
spending quite a bit of time on benchmarking. Previously, I had used
Google Caliper for
it (I even wrote an
evaluation tool for
it to produce better statistics). However, Caliper hasn't seen much updates
recently, and there is a very attractive similar tool at openJDK now, too:
Java Microbenchmarking
Harness (actually it can be used for benchmarking at other scale, too).
Now that I have experience in both, I must say I consider JMH superior, and
I have switched over my microbenchmarks to it. One of the nice things is that it
doesn't make this distinction of micro vs. macrobenchmarks, and the runtime of
your benchmarks is easier to control.
I largely
recreated his task using JMH. The benchmark task is easy: compute the sum of an array; the question is how much the cost is when allowing different data structures than
double[]
.My results, however, are quite different. And the statistics of JMH
indicate the differences may be not significant, and thus indicating that Java
manages to inline the code properly.
adapterFor 1000000 thrpt 50 836,898 13,223 ops/s adapterForL 1000000 thrpt 50 842,464 11,008 ops/s adapterForR 1000000 thrpt 50 810,343 9,961 ops/s adapterWhile 1000000 thrpt 50 839,369 11,705 ops/s adapterWhileL 1000000 thrpt 50 842,531 9,276 ops/s boxedFor 1000000 thrpt 50 848,081 7,562 ops/s boxedForL 1000000 thrpt 50 840,156 12,985 ops/s boxedForR 1000000 thrpt 50 817,666 9,706 ops/s boxedWhile 1000000 thrpt 50 845,379 12,761 ops/s boxedWhileL 1000000 thrpt 50 851,212 7,645 ops/s forSum 1000000 thrpt 50 845,140 12,500 ops/s forSumL 1000000 thrpt 50 847,134 9,479 ops/s forSumL2 1000000 thrpt 50 846,306 13,654 ops/s forSumR 1000000 thrpt 50 831,139 13,519 ops/s foreachSum 1000000 thrpt 50 843,023 13,397 ops/s whileSum 1000000 thrpt 50 848,666 10,723 ops/s whileSumL 1000000 thrpt 50 847,756 11,191 ops/sThe postfix is the iteration type: sum using for loops, with local variable for the length (L), or in reverse order (R); while loops (again with local variable for the length). The prefix is the data layout: the primitive array, the array using a static adapter (which is the approach I have been using in many implementations in cervidae) and using a "boxed" wrapper class around the array (roughly the approach that Daniel Lemire has been investigating. On the primitive array, I also included the foreach loop approach (for(double v:array) ).
If you look at the standard deviations, the results are pretty much identical,
except for reverse loops. This is not surprising, given the strong inlining capabilities
of Java - all of these codes will lead to next to the same CPU code after warmup and
hotspot optimization.
I do not have a full explanation of the differences the others have been seeing.
There is no "polymorphism" occurring here (at runtime) - there is only a single
Array implementation in use; but this was the same with his benchmark.
Here is a visualization of the results (sorted by average):
As you can see, most results are indiscernible. The measurement standard deviation is higher than the individual differences. If you run the same benchmark again, you will likely get a different ranking.
As you can see, most results are indiscernible. The measurement standard deviation is higher than the individual differences. If you run the same benchmark again, you will likely get a different ranking.
Note that performance may - drastically - drop once you use multiple
adapters or boxing classes in the same hot codepath. Java Hotspot keeps statistics
on the classes it sees, and as long as it only sees 1-2 different types, it performs
quite aggressive optimizations instead of doing "virtual" method calls.